Working with DataFrames

Both SimpleSDMLayers.jl and GBIF.jl offer an optional integration with the DataFrames.jl package. Therefore, our previous example with the kingfisher Megaceryle alcyon could also be approached with a DataFrame-centered workflow.

We will illustrate this using the same data and producing the same figures as in the previous example. To do so, we will use GBIF.jl to produce the occurrence DataFrame we will use throughout this example. However, it is also possible to use a DataFrame of your choosing instead of one generated by GBIF.jl, as long as it holds one occurrence per row, a column with the latitude coordinates, and a column with longitude coordinates. For the rest, it can hold whatever information you like. Most of our functions assume by default that the coordinates are stored in columns named :latitude and :longitude (the order doesn't matter), but you can generally specify other names with latitude = :lat in case you don't want to rename them (we will show you how below).

So let's start by getting our data:

# Load packages
using SimpleSDMLayers
using GBIF
using Plots
using Statistics
# Load DataFrames too
using DataFrames

# Load environmental data
temperature, precipitation = SimpleSDMPredictor(WorldClim{BioClim}, [1,12])

# Get GBIF occurrences
kingfisher = GBIF.taxon("Megaceryle alcyon", strict=true)
kf_occurrences = occurrences(kingfisher,
                             "hasCoordinate" => "true",
                             "decimalLatitude" => (0.0, 65.0),
                             "decimalLongitude" => (-180.0, -50.0),
                             "limit" => 200)
for i in 1:4
  occurrences!(kf_occurrences)
end
@info kf_occurrences
[ Info: DataFrames integration loaded
[ Info: Loading DataFrames support for GBIF.jl
Starting download for https://biogeo.ucdavis.edu/data/worldclim/v2.1/base/wc2.1_10m_bio.zip
┌ Info: Downloading
│   source = "https://biogeo.ucdavis.edu/data/worldclim/v2.1/base/wc2.1_10m_bio.zip"
│   dest = "rasterdata/WorldClim/BioClim/zips/wc2.1_10m_bio.zip"
│   progress = 1.0
│   time_taken = "0.03 s"
│   time_remaining = "0.0 s"
│   average_speed = "11.551 KiB/s"
│   downloaded = "343 bytes"
│   remaining = "0 bytes"
└   total = "343 bytes"
┌ Info: Downloading
│   source = "https://biogeo.ucdavis.edu/data/worldclim/v2.1/base/wc2.1_10m_bio.zip"
│   dest = "rasterdata/WorldClim/BioClim/zips/wc2.1_10m_bio.zip"
│   progress = 0.4912
│   time_taken = "1.0 s"
│   time_remaining = "1.04 s"
│   average_speed = "23.336 MiB/s"
│   downloaded = "23.359 MiB"
│   remaining = "24.200 MiB"
└   total = "47.559 MiB"
┌ Info: Downloading
│   source = "https://biogeo.ucdavis.edu/data/worldclim/v2.1/base/wc2.1_10m_bio.zip"
│   dest = "rasterdata/WorldClim/BioClim/zips/wc2.1_10m_bio.zip"
│   progress = 1.0
│   time_taken = "2.07 s"
│   time_remaining = "0.0 s"
│   average_speed = "23.009 MiB/s"
│   downloaded = "47.559 MiB"
│   remaining = "0 bytes"
└   total = "47.559 MiB"
[ Info: GBIF records: downloaded 1000 out of 100000

Once the data is loaded, we can easily convert the environmental layers to a DataFrame with the corresponding coordinates. We can do this for a single layer or for multiple layers at the same time:

# Single layer
temperature_df = DataFrame(temperature)
# Multiple layers
env_layers = [temperature, precipitation]
env_df = DataFrame(env_layers)
rename!(env_df, :x1 => :temperature, :x2 => :precipitation)
first(env_df, 5)

5 rows × 4 columns

longitudelatitudetemperatureprecipitation
Float64Float64Union…Union…
1-179.917-89.9167-31.017143.0
2-179.917-89.75-30.391940.0
3-179.917-89.5833-33.482240.0
4-179.917-89.4167-33.610437.0
5-179.917-89.25-33.719940.0

Note that the resulting DataFrame will include the values set to nothing in the layers. We might want to remove those rows using filter!:

filter!(x -> !isnothing(x.temperature) && !isnothing(x.precipitation), env_df);
last(env_df, 5)

5 rows × 4 columns

longitudelatitudetemperatureprecipitation
Float64Float64Union…Union…
1179.91770.9167-11.1038153.0
2179.91771.0833-12.7957161.0
3179.91771.25-12.8151148.0
4179.91771.4167-12.3703136.0
5179.91771.5833-12.3328122.0

GBIF.jl allows us to convert a set of occurrences to a DataFrame just as easily:

kf_df = DataFrame(kf_occurrences)
last(kf_df, 5)

5 rows × 18 columns

keynamedatasetpublished_inobserved_inlatitudelongitudedaterankobserverlicensekingdomphylumclassorderfamilygenusspecies
Int64Abstrac…?Abstrac…?Abstrac…?Abstrac…?Abstrac…?Abstrac…?DateTim…?Abstrac…?Abstrac…?Abstrac…?String?String?String?String?String?String?String?
12563550763Megaceryle alcyoniNaturalist research-grade observationsUSUS39.5994-105.0242020-01-21T09:46:00SPECIESibalinehttp://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon
22563550855Megaceryle alcyoniNaturalist research-grade observationsUSUS36.9703-122.0232020-01-21T16:36:00SPECIESJames Maughnhttp://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon
32563554552Megaceryle alcyoniNaturalist research-grade observationsUSUS27.961-81.1932020-01-16T10:49:00SPECIESEdward Perry IVhttp://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon
42563556487Megaceryle alcyoniNaturalist research-grade observationsUSUS37.8451-122.2292020-01-21T09:41:00SPECIESssherbrookehttp://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon
52563558060Megaceryle alcyoniNaturalist research-grade observationsUSUS30.1145-84.122020-01-22T07:25:00SPECIESDon Morrowhttp://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon

We can then extract the temperature values for all the occurrences.

temperature[kf_df]
1000-element Array{Float32,1}:
 18.931166
 18.931166
 17.857887
 17.08904
  9.419209
 12.142196
 10.534634
 12.142196
 24.394215
 14.874448
  ⋮
 18.522083
 19.516615
 22.006813
 15.539794
  9.653844
 13.974606
 22.402313
 14.676945
 19.857368

Or we can clip the layers according to the occurrences:

temperature_clip = clip(temperature, kf_df)
precipitation_clip = clip(precipitation, kf_df)
SDM predictor → 336×738 grid with 81323 Float32-valued cells
Latitudes	(5.916666666666657, 61.750000000000014)
Longitudes	(-172.58333333333331, -49.749999999999986)

In case your DataFrame has different column names for the coordinates, for example :lat and :lon, you can clip it like this:

kf_df_shortnames = rename(kf_df, :latitude => :lat, :longitude => :lon)
clip(temperature, kf_df_shortnames; latitude = :lat, longitude = :lon)
SDM predictor → 336×738 grid with 81323 Float32-valued cells
Latitudes	(5.916666666666657, 61.750000000000014)
Longitudes	(-172.58333333333331, -49.749999999999986)

We can finally plot the layer and occurrence values in a similar way to any DataFrame or Array.

histogram2d(temperature_clip, precipitation_clip, c = :viridis)
scatter!(temperature_clip[kf_df], precipitation_clip[kf_df],
         lab= "", c = :white, msc = :orange)

To plot the occurrence values over space, you can use:

contour(temperature_clip, c = :alpine, title = "Temperature",
        frame = :box, fill = true)
scatter!(kf_df.longitude, kf_df.latitude,
         lab = "", c = :white, msc = :orange, ms = 2)

We can finally make a layer with the number of observations per cells:

abundance = mask(precipitation_clip, kf_occurrences, Float32)
SDM response → 336×738 grid with 81323 Float32-valued cells
Latitudes	(5.916666666666657, 61.750000000000014)
Longitudes	(-172.58333333333331, -49.749999999999986)

A useful trick to visualize sites with occurrences, in contrast with sites without any occurrence, is to use replace or replace! to set the values returned as 0 or true by the function mask() to nothing. This allows us to first plot a background layer with a uniform colour, covering the whole area to visualize, then plot the occurrence layer on top using a different colour scale.

abundance_nozeros = replace(abundance, 0 => nothing)
plot(precipitation_clip, c = :lightgrey)
plot!(abundance_nozeros, c = :viridis, clim = extrema(abundance_nozeros))

Once again, the cells are rather small, and there are few observations, so this is not necessarily going to be very informative. As in our other example, to get a better sense of the distribution of observations, we can get the average number of observations in a radius of 100km around each cell (we will do so for a zoomed-in part of the map to save time):

zoom = abundance[left = -100.0, right = -75.0, top = 43.0, bottom = 20.0]
buffered = slidingwindow(zoom, Statistics.mean, 100.0)
plot(buffered, c = :lapaz, legend = false, frame = :box)
scatter!(kf_df.longitude, kf_df.latitude,
         lab = "", c = :white, msc = :orange, ms = 2, alpha = 0.5)